Projects and Content


Article: Eager Data Scientist’s Guide to Lazy Evaluation with Dask

Lazy evaluation is the core of parallelization, but it doesn’t have to be confusing or complicated — in this guide, learn the basic concepts you need to get started! I use dask to demonstrate but this is useful for anyone trying to get the hang of parallel computation. Keywords: python, parallelization, dask


Opinion: Your Data Scientist Does Not Need a STEM Ph.D.

You should not require or ask for a generic STEM Ph.D. for data scientist candidates. In this blog post, I give a detailed argument on this and discuss the critiques of the practice.

Keywords: data science, social science


Tutorial: Combining Dask and PyTorch for Better, Faster Transfer Learning

I use the Stanford Dogs dataset again, this time to demonstrate accelerating transfer learning to improve Resnet50. The inner workings of how PyTorch supports multi-machine, multi-GPU training can be confusing but I have deciphered it for you here.

Keywords: machine learning, python, deep learning


Tutorial: Computer Vision at Scale with Dask and PyTorch

I use the Stanford Dogs dataset to demonstrate accelerating an image classification problem with GPU Clusters. If you have been thinking about GPUs but don’t know where to start, or what they might be good for, I recommend this as a place to start!

Keywords: machine learning, python, deep learning


Article: 3 Ways to Schedule and Execute Python Jobs

Job scheduling is what takes academic machine learning to production level for real business or project value. I went through three tools (cron, Airflow, and Prefect) in this article and discussed the pros and cons to each. Depending on your task and circumstances, any one of these tools might be what you need.

Keywords: python, job scheduling, airflow


Article: Make Your Data Move: Using Schedulers with Data Storage to Generate Business Value

In concert with my presentation at ODSC Europe 2020 I wrote a blog post to discuss why and how you might use scheduled jobs to make your data infrastructure work better for modeling and machine learning.

Keywords: job scheduling, airflow, data warehousing


Model Behavior Video

I appeared in a video series for Uptake about how modeling industrial failures work - this was really fun, and I hope people will give it a look! I talk about a specific failure of diesel locomotive hardware, and how I solved it while I was working at Uptake.


Radlibs! in R or in Python

I wrote a silly R package called radlibs that allows you to make your own madlibs. Then I wrote a version in Python. Then I added them to CRAN and pypi. Data science doesn’t always have to be serious. Use install.packages("radlibs") or pip install radlibs to get these packages. Issues and feedback welcome!

Keywords: python, r, packages


Evaluation of R Forwards Package Workshop

I recently co-taught a daylong course for a group of 30 women/gender nonbinary students about how to write R packages- we had a really good time! I analyzed our pre- and post- surveys in a notebook, to check how effective the day was for students.

Keywords: r, data visualization


Kiva Loan Data Analysis

Keywords: r, data visualization


Tutorial: Fun with Real Estate Data

This project is a kaggle kernel, in which I walked the reader through the process of cleaning and modeling the data from a real estate prices dataset, using linear modeling, random forests, and gradient boosting (xgboost). My most popular kernel to date! This one also produced respectable competition results, and was chosen for special recognition by the Kaggle admins. (I won a mug!)

Update: Read the interview I did regarding this project (and the other fabulous winners)! http://blog.kaggle.com/2017/03/29/predicting-house-prices-playground-competition-winning-kernels

Keywords: machine learning, data cleaning


See more projects




kaggle | twitter | github | linkedin | youtube